Unsupervised learning on illicit trade data

Alice Lépissier

📧 alepissier@bren.ucsb.edu

Bren School of Environmental Science and Management

Policy and media attention on illicit financial flows (IFF) has increased, with the recognition that Africa is a net creditor to the world.

NYT-2013 Guardian-2015
New York Times (2013) Guardian (2015)
Guardian-2017 Economist-2019
Guardian (2017) Economist (2019)

What is trade mis-invoicing?

  • The deliberate mis-statement of price or quantity of internationally traded goods in invoices presented to customs
  • Can occur at import or export
  • Can result in an inflow or outflow of money

Motivations for trade mis-invoicing include:

  • Evading tariffs
  • Exploiting subsidy regimes
  • Subverting forex and capital controls
  • Hiding transfers of wealth

Mechanisms of mis-invoicing

conceptual-model

Why does trade mis-invoicing matter?

  • Outflows undermine the fiscal base and public spending
  • Financing gap needed to meet the Sustainable Development Goals (SDGs)
  • Combating trade mis-invoicing is crucial for the mobilization of domestic resources in the continent, and can catalyze sustainable development
governance-loop
Governance loop (credit: William Davis)

Data source

  • Trade mis-invoicing data-set of the United Nations Economic Commission for Africa
  • Panel with $n=6,248,254$ of mis-invoiced trade between 179 reporting jurisdictions and 179 partner countries for years 2000-2018 and disaggregated commodities
  • Citation: Lépissier, Alice, Davis, William, & Ibrahim, Gamal. (2019). Trade Mis-Invoicing Dataset (Version 1). DOI
  • The entire data-set panel_results.Rdata with ~6 million observations is available online.
  • The unit of observation is a reporter-partner-commodity-year tuple, where the data represent the illicit flow embedded in a transaction between a reporter $i$ and a partner $j$ for a commodity $c$ in year $t$.
  • The full panel is ~2GB. This notebook works with smaller summary data-sets generated from the full panel. The code to generate these summary data-sets is available on the repository https://github.com/walice/Trade-IFF by running the script file Compute IFF Estimates.R.

Methodology for calculating mis-invoiced trade

  • We locate mis-invoicing in the discrepancies between reported trade flows and their mirrored statistics
  • But not all observed discrepancies are due to illicit motives!
  • Imports tend to be recorded on the basis of Cost of Insurance and Freight (CIF), while exports are recorded Free On Board (FOB)

Our approach

  1. Estimate the discrepancies between imports and mirror net exports as a function of both licit and illicit predictors
  2. Perform a harmonization procedure in order to generate a reconciled value that represents our best estimate of what the legitimate value of the trade should be on a FOB basis
  3. Calculate the IFF embedded in each transaction as the difference between the observed value and the reconciled value
  • The panel can be aggregated over several dimensions, e.g. partner country, commodity, year. There are two methods to aggregate up the illicit flow:
    • Net basis: simply add up all negative and positive values, so that inflows and outflows cancel each other out
    • Gross Excluding Reversals (GER) basis: ignore all inflows, and sum up the positive values only across trading partners

Zoom in on Africa

  • During 2000-2018, the continent lost on average \$83 billion a year in gross illicit outflows
  • Net cumulative flows during that period were \$362 billion

Africa-map Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF

Goals¶

This project will apply the following techniques to the data:

  1. Dimension reduction using Principal Components Analysis (PCA)
  2. Clustering
  3. Graph analysis

Data wrangling¶

  1. Mis-invoiced trade for countries (aggregated data)
    • View 1: unit of observation = reporter-year; features = sectors
    • View 2: unit of observation = sector-year; features = reporters
  2. Bilateral matrix of mis-invoiced trade (dyadic data)
    • Unit of observation = reporter-partner-year

Mis-invoiced trade for countries by sectors

In [8]:
# Extract mis-invoicing in imports
IFF_Sector_Imp = IFF_Sector[['section', 'Imp_IFF']]
IFF_Sector_Imp
Out[8]:
section Imp_IFF
reporter.ISO year
DZA 2001 Pulp of Wood or of Other Fibrous Material 3.928938e+05
2001 Textiles 2.245219e+06
2001 Stone, Glass, and Ceramics 6.595719e+04
2001 Base Metals 6.278751e+07
2001 Machinery and Electrical 4.591851e+07
... ... ... ...
ZWE 2016 Mineral Products 0.000000e+00
2016 Chemicals and Allied Industries 0.000000e+00
2016 Plastics and Rubbers 0.000000e+00
2016 Raw Hides, Skins, Leather, and Furs 0.000000e+00
2016 Wood and Wood Products 0.000000e+00

7997 rows × 2 columns

Mis-invoiced trade for dyads

In [13]:
# Extract mis-invoicing in imports
IFF_Dest_Imp = IFF_Dest[['partner.ISO', 'Imp_IFF']]
IFF_Dest_Imp
Out[13]:
partner.ISO Imp_IFF
reporter.ISO year
DZA 2001 BEL 1.943304e+05
2001 CAN 7.026511e+05
2001 FRA 3.657772e+07
2001 DEU 8.026104e+06
2001 ITA 3.033231e+07
... ... ... ...
ZWE 2016 MUS 0.000000e+00
2016 MOZ 0.000000e+00
2016 NAM 0.000000e+00
2016 ZAF 0.000000e+00
2016 GBR 0.000000e+00

13030 rows × 2 columns

Metadata for countries

In [15]:
crosswalk = pd.read_excel(data_dir + 'crosswalk.xlsx').rename(columns={'Country': 'country'})
crosswalk.head()
Out[15]:
ISO3166.3 ISO3166.2 country UN_Region UN_Region_Code UN_Sub-region UN_Sub-region_Code UN_Intermediate_Region UN_Intermediate_Region_Code UN_M49_Code ... WB_Income_Group WB_Region WB_Lending_Category WB_Other IMF_Code OECD EU27 Arab League Centroid_Longitude Centroid_Latitude
0 ABW AW Aruba Americas 19.0 Latin America and the Caribbean 419.0 Caribbean 29.0 533.0 ... HIC Latin America & Caribbean .. NaN NaN 0 0 0 -69.966667 12.5000
1 AFG AF Afghanistan Asia 142.0 Southern Asia 34.0 NaN NaN 4.0 ... LIC South Asia IDA HIPC 512.0 0 0 0 65.000000 33.0000
2 AFG AF Afghanistan, Islamic Republic of Asia 142.0 Southern Asia 34.0 NaN NaN 4.0 ... LIC South Asia IDA HIPC 512.0 0 0 0 65.000000 33.0000
3 AGO AO Angola Africa 2.0 Sub-Saharan Africa 202.0 Middle Africa 17.0 24.0 ... LMC Sub-Saharan Africa IBRD NaN 614.0 0 0 0 18.500000 -12.5000
4 AIA AI Anguila Americas 19.0 Latin America and the Caribbean 419.0 Caribbean 29.0 660.0 ... NaN NaN NaN NaN NaN 0 0 0 -63.050000 18.2167

5 rows × 25 columns

Auxiliary functions¶

In [16]:
def create_features(data, values, features, obs):
    """
    Convert data-set in long format to wide and preserve information on year.
    
    data: {IFF_Sector_Imp, IFF_Dest_Imp, IFF_Dest_AFR, ...}, as Pandas dataframe, 
        name of data-set from which to create feature space, must be in long format
    values: {'Imp_IFF_hi', 'Exp_IFF_hi'}, as string, values that data-set will represent
    features: {'reporter.ISO', 'section', 'partner.ISO'}, as string, 
        what to use as the feature space
    obs: {'section', 'reporter.ISO'}, as string, what to use as the observation level
    """
    features_data = data.pivot_table(values=values, 
                                     columns=features, 
                                     index=[obs, 'year'], 
                                     fill_value=0)
    return features_data
In [18]:
def biplot_PCA(features_data, nPC=2, firstPC=1, secondPC=2, obs='reporter.ISO', show_loadings=False):
    """
    Project the data in the 2-dimensional space spanned by 2 principal components
    chosen by the user, along with a bi-plot of the top 3 loadings per PC, and color observations
    by class label.

    Args:
        features_data: as Pandas dataframe, data-set of features
        nPC: number of principal components
        firstPC: integer denoting first principal component to plot in bi-plot
        secondPC: integer denoting second principal component to plot in bi-plot
        obs: string denoting index of class labels (in features_data)
        show_loadings: Boolean indicating whether PCA loadings should be displayed
    Returns:
        plot (interactive)
        pca_loadings (if show_loadings=True)
    """
        
    # Run PCA (standardize data beforehand)
    features_data_std = StandardScaler().fit_transform(features_data)
    pca = PCA(n_components=nPC, random_state=234)
    princ_comp = pca.fit_transform(features_data_std)

    # Extract PCA loadings
    cols = ['PC' + str(c+1) for c in np.arange(nPC)]
    pca_loadings = pd.DataFrame(pca.components_.T, 
                                columns=cols,
                                index=list(features_data.columns))
   
    # Extract PCA scores
    pca_scores = pd.DataFrame(princ_comp, 
                              columns=cols)
    pca_scores[obs] = features_data.reset_index()[obs].values.tolist()
    pca_scores['year'] = features_data.reset_index()['year'].values.tolist()
    
    score_PC1 = princ_comp[:,firstPC-1]
    score_PC2 = princ_comp[:,secondPC-1]
    
    # Generate plot data
    if obs == 'reporter.ISO':
        plot_data = pd.merge(pca_scores, obs_info, on=[obs, 'year'])
        color_obs = 'reporter'
        tooltip_obs = ['reporter', 'year', 'Income group (World Bank)', 'Country status (UN)']
    else:
        plot_data = pca_scores
        color_obs = 'section'
        tooltip_obs = ['section', 'year']

    # Return chosen PCs to plot
    PC1 = 'PC'+str(firstPC)
    PC2 = 'PC'+str(secondPC)

    # Extract top loadings (in absolute value)
    # TO DO: use dict to iterate over
    toploadings_PC1 = pca_loadings.apply(lambda x: abs(x)).sort_values(by=PC1).tail(3)[[PC1, PC2]]
    toploadings_PC2 = pca_loadings.apply(lambda x: abs(x)).sort_values(by=PC2).tail(3)[[PC1, PC2]]

    originsPC1 = pd.DataFrame({'index':toploadings_PC1.index.tolist(), 
                               PC1: np.zeros(3), 
                               PC2: np.zeros(3)})
    originsPC2 = pd.DataFrame({'index':toploadings_PC2.index.tolist(), 
                               PC1: np.zeros(3), 
                               PC2: np.zeros(3)})
    
    toploadings_PC1 = pd.concat([toploadings_PC1.reset_index(), originsPC1], axis=0)
    toploadings_PC2 = pd.concat([toploadings_PC2.reset_index(), originsPC2], axis=0)

    toploadings_PC1[PC1] = toploadings_PC1[PC1]*max(score_PC1)*1.5
    toploadings_PC1[PC2] = toploadings_PC1[PC2]*max(score_PC2)*1.5
    toploadings_PC2[PC1] = toploadings_PC2[PC1]*max(score_PC1)*1.5
    toploadings_PC2[PC2] = toploadings_PC2[PC2]*max(score_PC2)*1.5
    
    # Project top 3 loadings over the space spanned by 2 principal components
    lines = alt.Chart().mark_line().encode()
    for color, i, dataset in zip(['#440154FF', '#21908CFF'], [0,1], [toploadings_PC1, toploadings_PC2]):
        lines[i] = alt.Chart(dataset).mark_line(color=color).encode(
        x= PC1 +':Q',
        y= PC2 +':Q',
        detail='index'
    ).properties(
        width=400,
        height=400
    )
    
    # Add labels to the loadings
    text=alt.Chart().mark_text().encode()
    for color, i, dataset in zip(['#440154FF', '#21908CFF'], [0, 1], [toploadings_PC1[0:3], toploadings_PC2[0:3]]):
        text[i] = alt.Chart(dataset).mark_text(
                align='left',
                baseline='bottom',
                color=color
            ).encode(
                x= PC1 +':Q',
                y= PC2 +':Q',
                text='index'
            )
    
    # Scatter plot colored by observation class label
    points = alt.Chart(plot_data).mark_circle(size=60).encode(
        x=alt.X(PC1, axis=alt.Axis(title='Principal Component ' + str(firstPC))),
        y=alt.X(PC2, axis=alt.Axis(title='Principal Component ' + str(secondPC))),
        color=alt.Color(color_obs, scale=alt.Scale(scheme='category20b'),
                       legend=alt.Legend(orient='right')),
        tooltip=tooltip_obs
    ).interactive()
    
    # Bind it all together
    chart = (points + lines[0] + lines[1] + text[0] + text[1])    
    chart.display()

    if show_loadings:
        return pca_loadings
In [19]:
def scree_plot(features_data, show_explained_var=False):
    """
    Create a cumulative scree splot and (optional) return the explained variance by each component.
    
    features_data: as Pandas dataframe, the data-set on which to run PCA
    show_explained_var: as Boolean, flag for whether to return explained variance
    """
    
    features_data_std = StandardScaler().fit_transform(features_data)
    pca = PCA(n_components=features_data_std.shape[1], random_state=234)
    princ_comp = pca.fit_transform(features_data_std)
    
    explained_var = pd.DataFrame({'PC': np.arange(1,features_data_std.shape[1]+1),
                                  'var': pca.explained_variance_ratio_,
                                  'cumvar': np.cumsum(pca.explained_variance_ratio_)})

    # Adapted from https://altair-viz.github.io/gallery/multiline_tooltip.html
    # Create a selection that chooses the nearest point & selects based on x-value
    nearest = alt.selection(type='single', nearest=True, on='mouseover',
                            fields=['PC'], empty='none')

    # The basic line
    line = alt.Chart(explained_var).mark_line(interpolate='basis', color='#FDE725FF').encode(
        alt.X('PC:Q',
            scale=alt.Scale(domain=(1, len(explained_var))),
            axis=alt.Axis(title='Principal Component')
        ),
        alt.Y('cumvar:Q',
            scale=alt.Scale(domain=(min(explained_var['cumvar']), 1)),
            axis=alt.Axis(title='Cumulative Variance Explained')
        ),
    )

    # Transparent selectors across the chart. This is what tells us
    # the x-value of the cursor
    selectors = alt.Chart(explained_var).mark_point().encode(
        x='PC:Q',
        opacity=alt.value(0),
    ).add_selection(
        nearest
    )

    # Draw points on the line, and highlight based on selection
    points = line.mark_point().encode(
        opacity=alt.condition(nearest, alt.value(1), alt.value(0))
    )

    # Draw text labels near the points, and highlight based on selection
    text = line.mark_text(align='left', dx=5, dy=-5).encode(
        text=alt.condition(nearest, 'cumvar:Q', alt.value(' '))
    )

    # Draw a rule at the location of the selection
    rules = alt.Chart(explained_var).mark_rule(color='gray').encode(
        x='PC:Q',
    ).transform_filter(
        nearest
    )

    # Put the five layers into a chart and bind the data
    out = alt.layer(
        line, selectors, points, rules, text
    ).properties(
        title='Cumulative scree plot',
        width=500, height=300
    )
    
    out.display()
    
    if show_explained_var:
        return explained_var[['PC', 'var']]

PCA on feature space (for individual reporting countries)¶

Sector features¶

In [20]:
sector_features = create_features(IFF_Sector_Imp, 'Imp_IFF', 
                                  features='section', obs='reporter.ISO')
sector_features
Out[20]:
section Animal and Animal Products Animal or Vegetable Fats and Oils Arms and Ammunition Base Metals Chemicals and Allied Industries Footwear and Headgear Machinery and Electrical Mineral Products Miscellaneous Manufactured Articles Pearls, Precious Stones and Metals ... Precision Instruments Prepared Foodstuffs Pulp of Wood or of Other Fibrous Material Raw Hides, Skins, Leather, and Furs Stone, Glass, and Ceramics Textiles Transportation Vegetable Products Wood and Wood Products Works of Art
reporter.ISO year
AGO 2009 405283.635122 0.0 0.0 0.000000e+00 0.000000 0.000000 0.000000e+00 6.478975e+10 0.000000e+00 0.000000 ... 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000e+00 0.000000
2010 0.000000 0.0 0.0 0.000000e+00 0.000000 0.000000 0.000000e+00 1.089132e+08 0.000000e+00 0.000000 ... 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 6.523912e+06 0.000000
2011 0.000000 0.0 0.0 0.000000e+00 0.000000 0.000000 0.000000e+00 4.979103e+07 0.000000e+00 0.000000 ... 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000e+00 0.000000
2012 988896.085180 0.0 0.0 0.000000e+00 0.000000 0.000000 0.000000e+00 1.096141e+08 0.000000e+00 114308.458397 ... 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000e+00 0.000000
2013 0.000000 0.0 0.0 0.000000e+00 0.000000 0.000000 0.000000e+00 8.278369e+09 0.000000e+00 1403.561562 ... 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 160710.588778 4.570304e+05 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZWE 2010 0.000000 0.0 0.0 2.117949e+06 0.000000 21336.565533 2.187900e+07 2.485023e+05 2.003530e+05 0.000000 ... 1.512825e+05 2.722857e+06 885184.681001 6573.065447 3220.283332 3.654199e+05 0.000000 115744.570640 1.163885e+05 0.000000
2011 0.000000 0.0 0.0 2.103863e+05 51449.393074 3188.156311 2.992221e+06 5.961897e+02 1.781356e+06 81983.524642 ... 0.000000e+00 4.093239e+05 0.000000 0.000000 0.000000 1.509153e+06 42739.185874 92654.845178 8.244830e+03 0.000000
2012 28077.193801 0.0 0.0 0.000000e+00 691967.749784 15313.268493 1.141664e+06 0.000000e+00 0.000000e+00 10408.020845 ... 1.309742e+06 1.005986e+06 47131.210630 19094.670625 127122.501231 1.903756e+04 0.000000 472.873529 5.327086e+03 0.000000
2015 0.000000 0.0 0.0 1.891896e+05 0.000000 0.000000 2.854722e+06 0.000000e+00 1.149907e+05 524.457424 ... 3.132032e+05 0.000000e+00 0.000000 28.166053 41927.573088 4.022486e+04 0.000000 4823.936654 0.000000e+00 10676.236998
2016 0.000000 0.0 0.0 0.000000e+00 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 ... 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 2.917164e+04 0.000000 11269.992850 0.000000e+00 0.000000

524 rows × 21 columns

Biplots¶

In [21]:
biplot_PCA(sector_features, 10, 1, 2, obs='reporter.ISO', show_loadings=True)
Out[21]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Animal and Animal Products 0.044141 0.048773 -0.121741 0.544399 0.040610 -0.643977 0.000473 0.505470 0.054638 -0.035190
Animal or Vegetable Fats and Oils 0.214518 0.115642 -0.381610 -0.095933 0.060412 -0.121698 -0.368568 -0.120595 -0.317380 -0.130230
Arms and Ammunition 0.009187 0.022579 -0.147060 -0.109685 0.947797 0.121945 0.032564 0.140734 0.156666 0.006301
Base Metals 0.238549 0.340234 -0.000067 -0.041597 0.008475 0.014610 -0.032349 -0.071596 -0.200190 -0.078895
Chemicals and Allied Industries 0.310148 -0.108579 0.002072 -0.084835 -0.022663 -0.030663 -0.162864 0.017378 0.046766 -0.105442
Footwear and Headgear 0.202086 -0.238673 0.158438 0.010886 0.001304 0.110779 0.351932 0.193803 -0.262254 -0.393471
Machinery and Electrical 0.290633 -0.232806 0.046778 0.046587 0.015939 -0.014017 -0.017379 -0.089503 0.138964 0.021593
Mineral Products 0.018252 0.025131 -0.151784 0.539210 -0.088627 0.698698 -0.323536 0.263938 0.063020 -0.041053
Miscellaneous Manufactured Articles 0.268048 -0.298020 0.085533 -0.048268 -0.000114 0.033611 0.056886 0.138228 -0.023183 -0.117920
Pearls, Precious Stones and Metals 0.153250 0.218693 -0.048139 -0.293326 -0.198982 0.126598 0.322953 0.444930 0.402063 0.322633
Plastics and Rubbers 0.261378 0.107312 -0.077377 0.321131 0.037432 0.053096 0.252303 -0.247964 -0.049133 0.223128
Precision Instruments 0.280445 -0.284080 0.083358 0.013548 -0.002918 0.003709 0.005280 -0.007083 0.062086 -0.074729
Prepared Foodstuffs 0.249012 0.305186 0.117482 -0.022023 -0.035972 0.025713 -0.056742 -0.024273 0.037447 -0.011021
Pulp of Wood or of Other Fibrous Material 0.225198 0.004975 0.108018 0.018551 -0.017791 -0.086366 -0.244965 -0.216488 0.632290 -0.096357
Raw Hides, Skins, Leather, and Furs 0.201661 0.312847 0.419293 0.111343 0.107871 0.036372 0.041900 -0.007676 -0.138495 0.054031
Stone, Glass, and Ceramics 0.288800 -0.253051 0.109953 -0.063059 0.025759 0.010057 -0.065686 0.092159 -0.022267 -0.071433
Textiles 0.184290 0.383869 0.366187 0.046166 0.087820 -0.021864 -0.039055 0.032805 -0.116442 -0.011063
Transportation 0.207111 -0.089849 -0.274307 0.301293 0.039817 -0.030034 0.346167 -0.439570 0.056626 0.263488
Vegetable Products 0.221859 0.184995 -0.381117 -0.202256 -0.130341 -0.082924 -0.192246 0.042048 0.122115 -0.166703
Wood and Wood Products 0.203222 0.094272 -0.411860 -0.157135 -0.077164 0.118007 0.293533 0.172876 -0.198836 0.002669
Works of Art 0.163513 -0.248847 0.086265 -0.114210 0.023045 -0.045777 -0.348851 0.170067 -0.277262 0.720630

treemap-sectors Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF

In [23]:
biplot_PCA(sector_features, 10, 5, 6, obs='reporter.ISO')

Explained variance¶

In [24]:
scree_plot(sector_features, show_explained_var=True)
Out[24]:
PC var
0 1 0.413912
1 2 0.146175
2 3 0.065264
3 4 0.051863
4 5 0.047923
5 6 0.046608
6 7 0.043570
7 8 0.041116
8 9 0.031854
9 10 0.028971
10 11 0.021008
11 12 0.016138
12 13 0.012833
13 14 0.008467
14 15 0.006607
15 16 0.005711
16 17 0.003583
17 18 0.002980
18 19 0.002576
19 20 0.001736
20 21 0.001106

Variance-stabilizing and normalizing transformations¶

In [25]:
# Plot distribution of illicit flow in each feature (i.e. sector)
fig, axes = joypy.joyplot(sector_features, colormap=plt.cm.viridis, figsize=(8,8),
                          title='Distribution of mis-invoicing across sectors');
In [29]:
sector_features_yeo = pd.DataFrame(sector_features_yeo,
                                   index=sector_features.index,
                                   columns=sector_features.columns)
fig, axes = joypy.joyplot(sector_features_yeo, colormap=plt.cm.viridis, figsize=(8,8),
                          title='Distribution of mis-invoicing across sectors (Yeo–Johnson transformation)');
In [31]:
biplot_PCA(sector_features_yeo, 10, 1, 2, obs='reporter.ISO')

Country features¶

In [33]:
country_features = create_features(IFF_Sector_Imp, 'Imp_IFF',
                                   features='reporter.ISO', obs='section')
country_features
Out[33]:
reporter.ISO AGO BDI BEN BFA BWA CAF CIV CMR COG COM ... STP SWZ SYC TGO TUN TZA UGA ZAF ZMB ZWE
section year
Animal and Animal Products 2000 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.000000e+00 0.000000 0.0 0.0 ... 0.0 0.0 0.000000 114933.589753 5.090304e+05 0.000000 0.000000 1.186521e+07 0.000000 0.000000
2001 0.0 0.0 12675.995475 8538.745344 0.000000 0.0 8.794734e+06 0.000000 0.0 0.0 ... 0.0 0.0 934152.133155 71227.720658 0.000000e+00 0.000000 5106.331325 8.202192e+06 2846.627417 0.000000
2002 0.0 0.0 18124.665791 13.709625 0.000000 0.0 7.482212e+06 13174.589981 0.0 0.0 ... 0.0 0.0 0.000000 15836.437143 1.070810e+06 0.000000 0.000000 5.718737e+06 2433.249532 0.000000
2003 0.0 0.0 90296.516275 0.000000 0.000000 0.0 4.083103e+06 107593.683090 0.0 0.0 ... 0.0 0.0 0.000000 78589.919694 2.985026e+06 103211.082853 0.000000 8.732803e+06 18879.599237 0.000000
2004 0.0 0.0 0.000000 0.000000 1806.559996 0.0 9.068470e+06 7732.258657 0.0 0.0 ... 0.0 0.0 0.000000 0.000000 3.228611e+06 0.000000 0.000000 1.841239e+07 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Works of Art 2014 0.0 0.0 0.000000 17156.132291 12830.226241 0.0 3.858378e+04 70537.006058 0.0 0.0 ... 0.0 0.0 0.000000 0.000000 0.000000e+00 21317.092371 0.000000 2.128602e+07 0.000000 0.000000
2015 0.0 0.0 3273.105849 0.000000 2198.496563 0.0 2.794819e+05 0.000000 0.0 0.0 ... 0.0 0.0 0.000000 0.000000 0.000000e+00 100330.822589 0.000000 5.479930e+06 158591.872202 10676.236998
2016 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.000000e+00 0.000000 0.0 0.0 ... 0.0 0.0 107378.779048 0.000000 0.000000e+00 347365.324137 0.000000 1.028624e+07 0.000000 0.000000
2017 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.000000e+00 0.000000 0.0 0.0 ... 0.0 0.0 1628.163632 0.000000 0.000000e+00 0.000000 0.000000 5.156807e+06 0.000000 0.000000
2018 0.0 0.0 0.000000 0.000000 5161.572298 0.0 2.174934e+05 0.000000 0.0 0.0 ... 0.0 0.0 0.000000 0.000000 0.000000e+00 0.000000 0.000000 8.229506e+06 124.678411 0.000000

388 rows × 44 columns

Biplots¶

In [34]:
biplot_PCA(country_features, 10, 1, 2, obs='section', show_loadings=True)
Out[34]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
AGO 0.017912 5.514075e-01 -6.595394e-02 0.025962 -2.231147e-02 -3.938764e-02 -5.258932e-02 -1.365304e-01 8.083240e-02 2.494126e-02
BDI 0.161808 -5.885873e-02 -2.142387e-01 -0.259797 -6.017636e-02 2.470629e-02 -1.663686e-01 -2.208025e-01 -1.119750e-01 -2.415305e-01
BEN 0.159161 -4.516412e-02 2.353180e-02 0.263605 3.624776e-01 -2.358708e-02 -2.624521e-01 -4.730997e-02 -6.956652e-02 -2.950970e-03
BFA 0.199627 -7.466810e-02 -1.517609e-01 0.292573 -5.319003e-02 1.185176e-01 -1.603284e-01 -8.790714e-02 3.385643e-02 1.430156e-01
BWA 0.226452 -2.470016e-02 1.964707e-01 -0.038896 3.186673e-02 9.754347e-02 2.581621e-02 -6.834754e-02 -6.786817e-02 2.845237e-02
CAF 0.060953 -2.423722e-02 6.145746e-02 0.019839 1.022512e-01 -2.300056e-01 9.429698e-02 -2.103033e-01 3.879913e-01 -2.200791e-01
CIV 0.105488 -1.896825e-02 -9.330936e-02 -0.109930 8.978703e-02 6.380665e-02 -1.438167e-01 -4.002698e-02 7.396497e-02 2.217393e-02
CMR 0.215935 1.471874e-02 -3.772548e-02 -0.239787 1.252240e-01 4.356172e-02 4.087396e-02 -1.242078e-01 -2.637905e-02 -2.965078e-01
COG 0.112733 1.666536e-01 2.915145e-03 -0.035490 1.528226e-01 5.178206e-02 3.567847e-02 3.004521e-01 -3.545424e-01 -1.809072e-01
COM 0.051127 7.197777e-03 1.756009e-02 0.122009 1.858534e-01 9.104870e-02 4.539034e-02 2.935408e-01 -1.857377e-01 -1.160032e-01
CPV 0.016607 5.534858e-02 -3.838267e-02 -0.142992 1.896416e-01 4.537723e-01 1.521892e-01 4.847682e-02 3.621027e-01 2.603169e-01
DZA 0.196923 -2.294360e-02 -1.548028e-01 -0.165103 -2.226779e-01 -1.516935e-01 -2.018801e-02 1.542668e-01 1.470740e-01 1.844884e-01
EGY 0.104280 1.378653e-01 -5.118140e-02 -0.169066 2.324012e-01 3.210950e-02 2.755167e-02 2.190735e-01 -2.158125e-01 2.199737e-01
ETH 0.158324 -4.014095e-02 -8.419988e-02 0.059786 -2.505009e-01 -7.751519e-02 -2.619006e-01 3.655624e-01 2.606315e-01 3.609116e-02
GAB 0.059861 -1.148785e-02 5.662378e-02 -0.102162 1.277980e-01 -3.641122e-01 3.614057e-01 -4.634173e-02 7.739867e-02 1.075931e-02
GHA 0.124902 4.418882e-03 1.478105e-01 0.183433 2.542247e-01 -1.003225e-01 1.144316e-01 -7.476554e-02 5.280047e-02 1.150774e-01
GIN 0.098123 -3.443361e-02 1.001610e-01 -0.320412 -5.070118e-02 -2.201624e-01 1.171149e-01 -3.761286e-02 -1.643017e-01 4.544630e-01
GMB 0.026534 -2.804263e-02 1.840854e-02 0.117042 2.080698e-01 6.417860e-02 -2.674997e-01 -1.200021e-01 -1.770794e-01 3.674214e-02
GNB 0.000000 8.271806e-25 5.082198e-21 0.000000 1.734723e-18 1.387779e-17 2.775558e-17 -5.551115e-17 5.551115e-17 -7.494005e-16
KEN 0.157485 1.608434e-02 2.924297e-01 -0.008968 3.180398e-02 1.970060e-02 3.639926e-02 -1.282441e-01 1.422254e-01 3.181977e-02
LSO 0.000974 1.201650e-03 3.446740e-02 0.020011 -4.914535e-02 1.189187e-02 1.419754e-02 6.500379e-02 -3.287577e-02 9.873186e-03
MAR 0.209989 9.801589e-02 -2.418021e-01 0.151462 -3.904627e-02 -1.255711e-01 1.322290e-01 1.142980e-01 1.172323e-01 -3.483941e-03
MDG 0.020075 -7.710883e-03 2.305018e-03 -0.031404 -2.126835e-02 -3.903643e-02 1.099263e-02 3.485184e-02 -1.220636e-02 7.459725e-02
MLI 0.223316 2.877725e-03 2.689983e-01 0.059863 -8.092077e-02 3.067764e-02 2.248653e-01 -6.047445e-02 -1.794675e-01 1.144377e-01
MOZ 0.070527 -3.420541e-03 -7.800596e-02 -0.110988 4.464858e-02 1.260780e-01 1.244225e-01 8.718813e-02 -8.948751e-04 -3.159536e-01
MRT 0.018405 5.526241e-01 -7.213094e-02 0.021217 -2.325910e-02 -5.317987e-02 -5.841704e-02 -1.159483e-01 5.676186e-02 3.066703e-02
MUS 0.115522 -2.574762e-02 6.388136e-02 -0.107589 6.100157e-02 -8.185087e-02 -2.549161e-02 -1.137196e-01 1.330623e-01 2.278600e-02
MWI 0.186959 1.665100e-02 3.445901e-01 0.110859 -1.685151e-01 1.346469e-01 -4.645074e-02 1.523103e-02 -6.798684e-02 -3.400650e-02
NAM 0.134673 -4.693020e-02 -2.149337e-01 -0.093922 2.017822e-01 1.538073e-01 -1.503298e-01 -7.454155e-02 2.832303e-02 1.997034e-01
NER 0.242242 -2.485852e-02 -3.380991e-02 0.163658 -2.412860e-01 2.891197e-02 1.482589e-02 1.893478e-01 1.459404e-01 -1.018770e-01
NGA 0.067717 7.075084e-02 4.424728e-02 -0.113234 1.622436e-01 4.486131e-01 2.405523e-01 1.120335e-01 2.811335e-01 -3.614680e-02
RWA 0.161312 -5.442593e-02 -3.241122e-01 0.118355 -1.425568e-01 1.451823e-01 2.532153e-01 -2.829166e-01 -1.829770e-01 -7.738377e-02
SDN 0.081731 5.369537e-01 4.777555e-02 0.047584 -5.916148e-02 -4.092852e-02 -5.137138e-02 -2.478842e-02 -6.413927e-02 -8.013745e-03
SEN 0.228531 -6.132647e-02 -1.272153e-01 -0.181528 1.714978e-01 -4.566210e-02 -1.463249e-01 -1.121548e-01 1.805564e-02 7.280521e-03
STP 0.079250 -5.119753e-02 -2.764644e-01 0.355536 -1.127582e-01 1.071384e-01 3.125278e-01 -2.375984e-01 -1.575535e-01 2.161396e-01
SWZ 0.004966 -1.508434e-02 -2.373776e-02 -0.006060 1.531431e-01 -5.435986e-02 -1.072552e-01 -6.064910e-02 -2.945053e-02 4.645084e-02
SYC 0.033172 -2.596771e-02 -6.809319e-02 0.217908 2.195726e-01 -2.140697e-01 -6.763963e-02 1.440052e-01 1.635484e-01 1.353497e-01
TGO 0.078850 -2.669243e-02 -8.186927e-02 0.131700 3.105691e-01 -2.783801e-01 2.653382e-01 1.303583e-01 2.484137e-02 -1.628648e-01
TUN 0.245546 -5.366217e-03 2.193663e-02 0.078355 -4.501813e-02 4.237581e-02 1.051398e-01 2.930481e-01 -6.585268e-02 7.821132e-02
TZA 0.251201 -7.327758e-03 -1.574792e-01 -0.108323 -5.830664e-02 -7.531258e-02 -1.496811e-02 1.063331e-01 -2.072387e-02 -2.071575e-01
UGA 0.249191 -2.314197e-02 1.489123e-01 -0.213065 -1.279806e-01 -5.609207e-02 -2.056086e-02 -1.449479e-02 -1.316680e-01 5.795158e-02
ZAF 0.269186 -4.844308e-02 -5.070727e-02 -0.015973 -2.835864e-02 -4.679396e-02 -3.673384e-03 -1.254192e-01 6.464041e-02 3.740865e-02
ZMB 0.228133 -4.934351e-02 1.969482e-01 0.097975 3.423265e-02 -6.979207e-03 -2.457654e-01 -8.606043e-02 5.312901e-02 1.113854e-02
ZWE 0.113996 1.247955e-02 3.180561e-01 0.148751 -9.851243e-02 9.580208e-02 3.608021e-02 -1.026538e-01 6.980560e-02 -1.909840e-01

percent-trade Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF) percent-GDP Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF)

Variance-stabilizing and normalizing transformations¶

In [45]:
biplot_PCA(country_features_log, 10, 1, 2, obs='section')
In [46]:
biplot_PCA(country_features_yeo, 10, 1, 2, obs='section')

PCA on bilateral trade matrix¶

In [47]:
partner_features = create_features(IFF_Dest_Imp, 'Imp_IFF', 
                                   features='partner.ISO', obs='reporter.ISO')
partner_features
Out[47]:
partner.ISO AGO ALB ARE ARG ARM AUS AUT AZE BDI BEL ... UGA UKR URY USA VEN VNM YEM ZAF ZMB ZWE
reporter.ISO year
AGO 2009 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.000000e+00 0.0 0.0 0.0 4.395286e+06 0.0 0.0
2010 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 8.584043e+07 0.0 0.0 0.0 0.000000e+00 0.0 0.0
2011 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 2.490753e+06 0.0 0.0 0.0 0.000000e+00 0.0 0.0
2012 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 114308.458397 ... 0.0 0.0 0.0 0.000000e+00 0.0 0.0 0.0 3.068458e+07 0.0 0.0
2013 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1403.561562 ... 0.0 0.0 0.0 0.000000e+00 0.0 0.0 0.0 0.000000e+00 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZWE 2010 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 30444.933040 ... 0.0 0.0 0.0 3.378385e+05 0.0 0.0 0.0 0.000000e+00 0.0 0.0
2011 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 8.550872e+04 0.0 0.0 0.0 4.273919e+04 0.0 0.0
2012 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 1.040802e+04 0.0 0.0 0.0 0.000000e+00 0.0 0.0
2015 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 1.548184e+05 0.0 0.0 0.0 0.000000e+00 0.0 0.0
2016 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 2.917164e+04 0.0 0.0 0.0 0.000000e+00 0.0 0.0

524 rows × 135 columns

Biplots¶

In [48]:
biplot_PCA(partner_features, 10, 1, 2, show_loadings=True)
Out[48]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
AGO 0.072229 -0.059823 -0.022020 0.072610 -0.094362 -0.071972 -0.212337 0.093214 0.038881 -0.135991
ALB 0.023496 0.141048 0.065847 -0.149302 -0.068538 0.188247 -0.243690 -0.238979 -0.032977 -0.028917
ARE 0.090988 0.119464 -0.025234 0.260371 0.032576 -0.036729 -0.044910 -0.004393 -0.058071 0.028610
ARG 0.101247 0.057838 -0.059467 -0.033063 0.000191 -0.130590 0.061327 -0.152660 -0.043180 -0.052758
ARM 0.040794 -0.033612 0.118821 0.034115 0.000833 0.003838 0.030450 -0.054974 -0.025408 0.132202
... ... ... ... ... ... ... ... ... ... ...
VNM 0.115947 -0.081192 0.189902 0.124218 -0.152074 0.042535 -0.004353 0.027226 0.025113 0.016085
YEM 0.020365 0.124612 0.041676 -0.142488 -0.110078 -0.046202 -0.005136 -0.038015 -0.084409 -0.198996
ZAF 0.024976 0.124980 -0.060703 0.168502 0.079109 0.007476 0.031707 -0.006569 -0.038583 -0.018313
ZMB 0.011550 -0.025564 -0.022726 -0.000072 -0.024807 0.010710 0.006134 0.011464 -0.022610 0.068491
ZWE 0.062134 -0.058334 0.003793 -0.004270 -0.037508 0.120834 0.188857 -0.025397 -0.012800 0.055406

135 rows × 10 columns

top-destinations Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF

Intra-African illicit financial flows¶

In [62]:
partner_features_AFR = create_features(IFF_Dest_Imp_AFR, 'Imp_IFF', 
                                       features='partner.ISO', obs='reporter.ISO')
partner_features_AFR
Out[62]:
partner.ISO AGO BDI BEN BFA BWA CAF CIV CMR COG COM ... STP SWZ SYC TGO TUN TZA UGA ZAF ZMB ZWE
reporter.ISO year
AGO 2009 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 ... 0 0 0.0 0.0 0.0 0.0 0.0 4.395286e+06 0.0 0.0
2010 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 ... 0 0 0.0 0.0 0.0 0.0 0.0 0.000000e+00 0.0 0.0
2012 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 ... 0 0 0.0 0.0 0.0 0.0 0.0 3.068458e+07 0.0 0.0
2013 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 ... 0 0 0.0 0.0 0.0 0.0 0.0 0.000000e+00 0.0 0.0
2014 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 ... 0 0 0.0 0.0 0.0 0.0 0.0 0.000000e+00 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZWE 2010 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 ... 0 0 0.0 0.0 0.0 0.0 0.0 0.000000e+00 0.0 0.0
2011 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 ... 0 0 0.0 0.0 0.0 0.0 0.0 4.273919e+04 0.0 0.0
2012 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 ... 0 0 0.0 0.0 0.0 0.0 0.0 0.000000e+00 0.0 0.0
2015 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 ... 0 0 0.0 0.0 0.0 0.0 0.0 0.000000e+00 0.0 0.0
2016 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 ... 0 0 0.0 0.0 0.0 0.0 0.0 0.000000e+00 0.0 0.0

477 rows × 40 columns

In [66]:
biplot_PCA(partner_features_AFR, partner_features_AFR.shape[1], 1, 2)

Removing outliers¶

In [68]:
partner_features_AFR_noout = partner_features_AFR[~outlying]
In [69]:
biplot_PCA(partner_features_AFR_noout, 40, 1, 2)

Coloring by class label¶

In [73]:
biplot_PCA_classes(sector_features, 21, 1, 2, classes='Income group (World Bank)')
In [74]:
biplot_PCA_classes(partner_features, 46, 1, 2)
In [78]:
biplot_PCA_classes(partner_features_AFR_noout, 40, 1, 2)

Clustering¶

Agglomerative clustering¶

In [81]:
# Perform agglomerative clustering
from sklearn.cluster import AgglomerativeClustering
clustering = AgglomerativeClustering(n_clusters=5, linkage='ward').fit(X)
plt.scatter(X[:, 0], X[:, 1], c=clustering.labels_);
plt.title('Agglomerative clustering using Ward linkage');

Hierarchical clustering and corresponding heatmap¶

In [89]:
# Plot heatmap and corresponding dendograms
g = sns.clustermap(data_scaled,
                   method='ward',
                   cmap='bone_r',
                   row_colors=region_color,
                  );
g.fig.suptitle('Hierarchical clustering with Ward linkage', y=1);

Network analysis¶

In [94]:
def create_graph_data(year=2018, threshold=True, threshold_var='GDP', flow_threshold=10000):
    """
    Returns data-set to be used to generate a directed graph.
    
    year: specify which year to use for the network analysis (between 2000-2018)
    threshold: as boolean, indicate whether to restrict reporting countries to conduits
    threshold_var: as string, indicate which variable to use when determining which 
        reporting countries are conduits ('GDP' or 'trade')
    flow_threshold: as numeric, specify the cut-off under which to ignore the dollar values
        of mis-invoiced imports
    """
    flow_data = IFF_Dest.reset_index().query('year == @year')
    flow_data = flow_data.loc[flow_data['Imp_IFF'].notnull(), :]
    flow_data = flow_data.query('Imp_IFF >= @flow_threshold')
    
    if threshold:
        conduits = 'conduits_' + threshold_var
        flow_data = flow_data[flow_data['reporter.ISO'].isin(eval(conduits).index)]
    return flow_data

Thresholding¶

In [95]:
# Import data on mis-invoiced trade aggregated by destination and sector for each African country
IFF_Year = pd.read_csv(results_dir + 'GER_Orig_Year_Africa.csv')

# Restrict data to 2018 as an illustrative example
IFF_Year = IFF_Year.query('year == 2018')

# Plot distribution of proportional mis-invoiced imports for each African country
sns.distplot(IFF_Year['Tot_IFF_GDP'].apply(lambda x: x*100), 
             kde=True, label='% GDP', bins=10);
sns.distplot(IFF_Year['Tot_IFF_trade'].apply(lambda x: x*100),
             kde=True, label='% trade', bins=10);
plt.legend()
plt.title('Distribution of IFF in Africa in 2018')
plt.xlabel('Mis-invoiced trade for African countries');
In [96]:
print('Mean outflow as proportion of GDP:',IFF_Year['Tot_IFF_GDP'].mean(),
      '\nMean outflow as proportion of trade', IFF_Year['Tot_IFF_trade'].var())
Mean outflow as proportion of GDP: 0.023895600645560334 
Mean outflow as proportion of trade 0.002387875707799794
In [97]:
thresh_GDP = 0.1
thresh_trade = 0.17
conduits_GDP = IFF_Year.query('Tot_IFF_GDP >= @thresh_GDP').set_index('reporter.ISO')
conduits_trade = IFF_Year.query('Tot_IFF_trade >= @thresh_trade').set_index('reporter.ISO')
In [98]:
conduits_GDP[['reporter', 'year', 'Tot_IFF_GDP', 'GDP']]
Out[98]:
reporter year Tot_IFF_GDP GDP
reporter.ISO
In [99]:
conduits_trade[['reporter', 'year', 'Tot_IFF_trade', 'Total_value']]
Out[99]:
reporter year Tot_IFF_trade Total_value
reporter.ISO
ZAF South Africa 2018 0.191945 189028379597
In [100]:
# Import data where value of illicit flow is standardized for partners
IFF_std = pd.read_csv(results_dir + 'GER_Orig_Dest_Year_std.csv')
IFF_std = IFF_std.query('year == 2018')

# Merge in with flow data
flow_data = pd.merge(left=create_graph_data(2018, threshold_var='GDP'),
                     right=IFF_std[['reporter.ISO', 'partner.ISO', 'pImp_IFF_GDP']],
                     on=['reporter.ISO', 'partner.ISO'])

# Plot distribution of IFF in partner countries
sns.distplot(flow_data['pImp_IFF_GDP'], kde=True);
plt.title('Distribution of IFF in partner countries in 2018 (dyad-level)')
plt.xlabel('Mis-invoiced trade as proportion of GDP in partner countries');
In [101]:
def threshold_partner_IFF(flow_data, year=2018, partner_threshold=0.0001):
    """
    Filters bilateral flow data-set to minimum level of illicit flow relative to partner GDP.
    
    flow_data: as Pandas dataframe, name of data-set which contains bilateral flow data
        (in wide format)
    year: as string, specify which year to use for the network analysis (between 2000-2018)
    partner_threshold: as numeric, specify the cut-off for the minimum proportion of partner
        GDP that a bilateral flow must represent in order to be included
    """  
    IFF_std = pd.read_csv(results_dir + 'GER_Orig_Dest_Year_std.csv')
    IFF_std = IFF_std.query('year == @year')
    flow_data = pd.merge(left=flow_data,
                     right=IFF_std[['reporter.ISO', 'partner.ISO', 'pImp_IFF_GDP']],
                     on=['reporter.ISO', 'partner.ISO'])
    flow_data = flow_data.query('pImp_IFF_GDP >= @partner_threshold')
    return flow_data

Directed graph for GDP¶

In [102]:
# Graph data for conduits relative to GDP
flow_data_GDP = create_graph_data(2018, threshold_var='GDP')
flow_data_GDP = threshold_partner_IFF(flow_data_GDP)
In [103]:
# Create directed graph
graph = nx.from_pandas_edgelist(flow_data_GDP,
                                'reporter.ISO',
                                'partner.ISO', 
                                'Imp_IFF',
                                create_using = nx.DiGraph())
In [107]:
def set_graph_attributes(graph):
    """
    Sets node attributes and create auxiliary variables to be used in graph visualization.
    
    Returns:
    col: list of values to color nodes according to (node GDP per capita)
    edge_col: array of values to color edges according to (logged flow between nodes)
    sizes: list of values to size nodes (proportional to degree)
    labels: dict of node labels (where a node is labelled if outdegree is at least 1)
    """
    # Create dictionary of GDP per capita for each country
    GDP_attr = covariates.loc[:, ['GDPpc']]
    GDP_attr = GDP_attr.to_dict('index')
    # Set GDP per capita as a node attribute
    nx.set_node_attributes(graph, GDP_attr)

    # Create list of colors for nodes in the graph
    col = [nx.get_node_attributes(graph, 'GDPpc')[n] for n in graph.nodes]

    # Create list of colors for edges in the graph
    edge_col = [nx.get_edge_attributes(graph, 'Imp_IFF')[e] for e in graph.edges]
    edge_col = np.log(edge_col)

    # Extract outdegree (the number of edges coming out of nodes) and degree
    outdeg = graph.out_degree
    deg = graph.degree
    # Size of nodes will be proportional to their degree
    sizes = [10 * deg[c] for c in graph.nodes]
    # Label the countries if their outdegree is at least 1, i.e. if they are the reporting African countries
    labels = {c: c if outdeg[c] >= 1 else ''
              for c in graph.nodes}
    return col, edge_col, outdeg, deg, sizes, labels
In [108]:
# Generate graph attributes and auxiliary variables
col, edge_col, outdeg, deg, sizes, labels = set_graph_attributes(graph)
In [109]:
# Draw directed graph and color nodes by GDP per capita
plt.figure(figsize = (10,8))

pos = nx.spring_layout(graph)
nodes = nx.draw_networkx_nodes(graph, pos,
                               node_color=col, cmap=plt.cm.spring_r)
edges = nx.draw_networkx_edges(graph, pos,
                               edge_color=edge_col, 
                               edge_cmap=plt.cm.get_cmap('RdPu'))
thous_fmt = FuncFormatter(lambda x, p: format(int(x), ','))
nx.draw_networkx_labels(graph, pos, font_size=10)
clb = plt.colorbar(nodes, format=thous_fmt)
clb.ax.set_title('GDP per capita');
In [110]:
# Create dictionary of longitude and latitude for countries
geo_pos = {country: (v['Centroid_Longitude'], v['Centroid_Latitude'])
           for country, v in
           crosswalk.drop_duplicates('ISO3166.3').set_index('ISO3166.3').to_dict('index').items()
}
list(geo_pos.items())[0:10]
Out[110]:
[('ABW', (-69.966667, 12.5)),
 ('AFG', (65.0, 33.0)),
 ('AGO', (18.5, -12.5)),
 ('AIA', (-63.05, 18.2167)),
 ('ALA', (19.952, 60.198)),
 ('ALB', (20.0, 41.0)),
 ('AND', (1.5, 42.5)),
 ('ANT', (-68.87, 12.123)),
 ('ARE', (54.0, 24.0)),
 ('ARG', (-64.0, -34.0))]
In [111]:
# Map projection
crs = ccrs.PlateCarree()
fig, ax = plt.subplots(1, 1, figsize=(20, 8), subplot_kw=dict(projection=crs))
ax.coastlines(color='lightgray')
ax.add_feature(cfeature.LAND, facecolor='lightgray')
ax.add_feature(cfeature.BORDERS, edgecolor='white')
# Uncomment for the map to span the whole world
# ax.set_extent([-160, 180, -60, 90], crs=ccrs.PlateCarree())

# Overlay network graph
nx.draw_networkx(graph, ax=ax,
                 pos=geo_pos,
                 font_size=14,
                 alpha=0.7,
                 node_color=col, cmap=plt.cm.spring_r,
                 edge_color=edge_col, edge_cmap=plt.cm.get_cmap('RdPu'),
                 node_size=sizes,
                 labels=labels)

Directed graph for trade¶

In [112]:
# Graph data for conduits relative to trade
flow_data_trade = create_graph_data(2018, threshold_var='trade')
flow_data_trade = threshold_partner_IFF(flow_data_trade)
In [113]:
# Create directed graph
graph = nx.from_pandas_edgelist(flow_data_trade,
                                'reporter.ISO',
                                'partner.ISO', 
                                'Imp_IFF',
                                create_using = nx.DiGraph())
In [115]:
# Generate graph attributes and auxiliary variables
col, edge_col, outdeg, deg, sizes, labels = set_graph_attributes(graph)
In [116]:
# Draw directed graph and color nodes by GDP per capita
plt.figure(figsize = (10,8))

pos = nx.spring_layout(graph)
nodes = nx.draw_networkx_nodes(graph, pos,
                               node_color=col, cmap=plt.cm.spring_r)
edges = nx.draw_networkx_edges(graph, pos,
                               edge_color=edge_col, 
                               edge_cmap=plt.cm.get_cmap('RdPu'))
thous_fmt = FuncFormatter(lambda x, p: format(int(x), ','))
nx.draw_networkx_labels(graph, pos, font_size=10)
clb = plt.colorbar(nodes, format=thous_fmt)
clb.ax.set_title('GDP per capita');
In [117]:
# Map projection
crs = ccrs.PlateCarree()
fig, ax = plt.subplots(1, 1, figsize=(20, 8), subplot_kw=dict(projection=crs))
ax.coastlines(color='lightgray')
ax.add_feature(cfeature.LAND, facecolor='lightgray')
ax.add_feature(cfeature.BORDERS, edgecolor='white')
# Uncomment for the map to span the whole world
# ax.set_extent([-160, 180, -60, 90], crs=ccrs.PlateCarree())

# Overlay network graph
nx.draw_networkx(graph, ax=ax,
                 pos=geo_pos,
                 font_size=14,
                 alpha=0.7,
                 node_color=col, cmap=plt.cm.spring_r,
                 edge_color=edge_col, edge_cmap=plt.cm.get_cmap('RdPu'),
                 node_size=sizes,
                 labels=labels)

Evolution of the network during 2000-2018¶

Conclusion¶

Policy recommendations

  • Target 16.4 of the SDGs: "By 2030, significantly reduce illicit financial and arms flows, strengthen the recovery and return of stolen assets and combat all forms of organized crime"
  • Estimates of the magnitude and severity of the problem have shed a spotlight on the need to combat illicit financial flows
  • Deeper understanding of the origins and destinations is now crucial to guiding targeted interventions to combat illicit flows
  • Strong subregional effects are apparent in illicit trade within the continent. Initiatives to combat illicit flows could be anchored within the Regional Economic Communities of the African Union, such as ECOWAS.

Next steps

I plan to take this project further. Notably, I plan to conduct analysis on the most disaggregated view of the data, that is, for a reporter-partner-year-commodity tuple. By filtering by commodity sector, I will be able to identify the relevant sinks and sources.

Moreover, I am currently exploring spectral clustering. However, spectral clustering algorithms are currently implemented for undirected rather than directed graphs. There are several possible approaches to dealing with clustering on directed graphs. One of them includes a naive approach where direction is ignored, and where the graph is treated as an undirected network.

Finally, I am considering non-negative matrix factorization as an alternative form of dimension reduction.